Standard exact clause detection

ABSTRACT

Embodiments relate to a system and a method for identifying, from contractual documents, (i) standard exact clauses matching clause examples and (ii) non-standard clauses semantically related to but not matching the clause examples. A standard feature data set comprising standard exact clauses matching clause examples is obtained. In addition, a mirror feature data set comprising semantically related clauses of the clause examples is obtained using semantic language analysis, where the mirror feature data set encompasses the standard feature data set. Non-standard clauses are obtained by extracting a difference between the mirror feature data set and the standard exact feature data set.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a reissue of U.S. Pat. No. 10,185,712, which wasfiled as U.S. application Ser. No. 15/723,023 on Oct. 2, 2017, which isa continuation of U.S. application Ser. No. 14/797,959, filed Jul. 13,2015, now U.S. Pat. No. 9,805,025, which are incorporated herein byreference in their entirety. More than one reissue application has beenfiled for the reissue of U.S. Pat. No. 10,185,712. The reissueapplications are application Ser. Nos. 17/086,288 (the presentapplication) filed Oct. 30, 2020 and 17/588,656 filed Jan. 31, 2022,which is both a continuation reissue of application Ser. No. 17/086,288and a reissue of application Ser. No. 15/723,023.

BACKGROUND

1. Field of Art

The disclosure generally relates to the field of natural languageprocessing, and in particular, to identifying and extracting informationfrom documents.

2. Description of the Related Art

A contract is a document that defines legally enforceable agreementsbetween two or more parties. During the negotiation process, parties tothe contract often agree to make multiple amendments or addendums, andthese amendments or addendums can be stored in random formats indifferent locations.

Frequent changes in contracts often present challenges to conventionalapproaches for finding contracts and amendments, as conventionalapproaches typically focus on the unstructured text only and are notable to extract relevant and important information correctly. Forexample, a contract and amendments may include the clauses that containwording such as “net 30 days,” “within 30 days,” “30 day's notice,” and“2% penalty.” On the other hand, one of the amendments may include thenon-standard clauses such as “5 working days” with “60% penalty.”Without the ability to discover the clauses and types of the clausesaccounting for their semantic variations, any party not keeping track ofthe amendments or the addendums is vulnerable to a significant amount ofrisk of overlooking unusual contractual terminologies.

BRIEF DESCRIPTION OF DRAWINGS

The disclosed embodiments have advantages and features which will bemore readily apparent from the detailed description, the appendedclaims, and the accompanying figures (or drawings). A brief introductionof the figures is below.

FIG. 1 illustrates one embodiment of a standard exact clause detectionsystem for a contractual document.

FIG. 2 illustrates an input processor of the standard exact clausedetection system configured to process input data.

FIG. 3 illustrates a discovery engine of the standard exact clausedetection system to properly structure and to normalize the input data.

FIG. 4 illustrates a representation of data stored as discreet databasedocuments with different indexes.

FIG. 5 illustrates an analysis engine of the standard exact clausedetection system to define standard exact clauses in contractualdocuments.

FIG. 6 illustrates a flow chart of a method of obtaining standard exactclauses and non-standard clauses.

FIG. 7 illustrates a process for determining a policy to specify clausesfor extraction.

FIG. 8 illustrates components of an example machine able to readinstructions from a machine-readable medium and execute them in aprocessor (or controller).

DETAILED DESCRIPTION

The Figures (FIGS.) and the following description relate to preferredembodiments by way of illustration only. It should be noted that fromthe following discussion, alternative embodiments of the structures andmethods disclosed herein will be readily recognized as viablealternatives that may be employed without departing from the principlesof what is claimed.

Reference will now be made in detail to several embodiments, examples ofwhich are illustrated in the accompanying figures. It is noted thatwherever practicable similar or like reference numbers may be used inthe figures and may indicate similar or like functionality. The figuresdepict embodiments of the disclosed system (or method) for purposes ofillustration only. One skilled in the art will readily recognize fromthe following description that alternative embodiments of the structuresand methods illustrated herein may be employed without departing fromthe principles described herein.

Configuration Overview

One embodiment of a disclosed configuration is a system (or a method ora non-transitory computer readable medium) for identifying standardexact clauses and non-standard clauses used in contractual documents. Astandard exact clause herein refers to a clause including words and anorder of words matching those of a predefined clause example. Anon-standard clause herein refers to a clause semantically related to apredefined clause example, but including words or an order of wordsdifferent from those of the predefined clause example. By identifyingstandard exact clauses and non-standard clauses from a corpus amount ofcontractual documents, exact clauses and semantically related clausescan be identified promptly to improve contract review process. It isnoted that although described in a context of contracts, the principlesdescribed herein can apply to other structured documents.

In one embodiment, the system includes an input processor to configureraw input data into a format that can be structurally analyzed by adiscovery engine. The discovery engine generates a predefined policy tobe applied in a search engine. With the predefined policy, the discoveryengine prepares initial search results to allow an administrator toselect items to build and test a new custom policy along with all thepredefined polices in a format that can be viewed by an end user. In theanalysis engine, the end user can view the initial search results, andalso customize the predefined policy to define a primary policy. Withthe primary policy, the analysis engine and the semantic languageevaluator perform semantic language analysis, and first determine thestandard clauses. Among the standard clauses, standard exact clauseswith words and an order of the words exactly matching clause examplesare identified. Furthermore, the analysis engine and the semanticlanguage evaluator perform another semantic language analysis with aless restrictive secondary policy to extract the non-standard clauses.

Non-Standard and Standard Clause Detection System

FIG. 1 illustrates one exemplary embodiment of a standard exact clausedetection system 10 including one or more input processors (generally aninput processor 110), a discovery engine 120, an analysis engine 130, asemantic language evaluator 140, and an analysis database 150. Each ofthese components may be embodied as hardware, software, firmware or acombination thereof. In various embodiments, engines or modules includesoftware (or firmware) structured to operate with processing componentsof a computing system to form a machine that operates as describedherein for the corresponding engines or modules. Further, two or moreengines may interoperate to form a machine that operates as describedherein. Examples of the processing components of the computing systemare described with respect to FIG. 8 . The system 10 also comprises adiscovery database 160 to store data for identifying standard exactclauses and non-standard clauses.

As illustrated in FIG. 1 , the input processor 110 aggregates one ormore raw data 100(0), 100(1) . . . 100(N) (generally 100) and processesthem in an appropriate format. Also, the discovery engine 120 iscommunicatively coupled to the input processor 110. In addition, theanalysis engine 130 is coupled to the discovery engine 120. Thediscovery engine 120 develops a predefined policy and initial searchresults. Additionally, the analysis engine 130 performs a semanticlanguage analysis by applying policies to the semantic languageevaluator 140, and determines the non-standard clauses, standardclauses, and standard exact clauses used in the raw data 100. Throughoutthe process the discovery database 160 stores the initial searchresults, metadata, and the predefined policy. In addition, the discoverydatabase 160 is communicatively coupled to the input processor 110, thediscovery engine 120, and the analysis engine 130. Additionally, theanalysis engine 130 is coupled to the analysis database 150 and storesinformation for performing semantic language evaluation. In oneembodiment, the discovery database 160 and the analysis database 150 canbe combined into one database.

Turning to FIG. 2 , it illustrates an exemplary embodiment of an inputprocessor 110 that may aggregate the raw data 100, and refine them intoacceptable formats in the following stages. As shown in FIG. 2 , theinput processor 110 includes a file import system module 212, acorrection module 213, and a format standardization module 214.

The file import system module 212 receives the raw data 100 from any oneof file systems, emails, Content Management Systems and physicaldocument scanning devices. The file import system module 212 alsodetects potential contracts and checks if any duplicates of documentsexist in the discovery database 160 already. In addition, the fileimport system module 212 can convert a physical document into anotherelectronic format, for example Portable Document Format (PDF), MicrosoftOffice format, Tagged Image File Format (TIFF), Graphics InterchangeFormat (GIF), Join Photographic Experts Group (JPEG) and etc. Moreover,the file import system module 212 may include an image file processormodule with an optical character recognition (OCR) engine 218. The OCRengine 218 may be an ABBYY fine reader engine or a standard iFilter OCRengine. It is to be noted that other types of OCR engine or anycombinations of OCR engine may be implemented. Furthermore, the fileimport system module 212 detects the language of the contractualdocument and how many words exist within. In one aspect, the OCR engine218 of the file import system module 212 determines a quality of the OCRperformed for each character or each word, and generates a quality scoreindicating a quality of the OCR performed for each character or eachword.

The correction module 213 in the input processor 110 receives the dataimported from the file import system module 212. The correction module213 also is configured to apply typographical corrections or OCRcorrections.

In an exemplary embodiment, the format standardization module 214tailors the format of the data imported from the file import systemmodule 212 for further processing. The format standardization module 214applies filters to extract textual information. In addition, the inputprocessor 110 may remove passwords to access a protected contractualdocument only when the owners of the documents agree to remove suchpasswords. Furthermore, the format standardization module 214 includes afile protection function that creates copies of potential contractualdocuments identified. These identified contractual documents are storedin the discovery database 160 with security access attributes.

Next, FIG. 3 illustrates an embodiment of the discovery engine 120 thatstructurally analyzes an input data from the input processor 110 andgenerates the predefined policy. The predefined policy includes, but notlimited to, predefined rules, predefined features, and predefined clauseexamples.

The discovery engine 120 also applies the predefined policy into thesearch engine (not shown) and prepares initial search results along withthe predefined policy and metadata in a format that allows the end userto view. As shown in FIG. 3 , the discovery engine 120 includes apre-normalization module 321, a language detection module 322, aprocessing queue module 323, a structuration function module 324, arules processing module 325, a post processing and reduction module 326,and a high level processing module 327.

The pre-normalization module 321 receives the imported data in thestandardized format obtained from the input processor 110, and convertsthe imported data into the standard XML or HyperText Markup Language(HTML) document. Also, the language detection module 322 can identifythe language used in the XML or HTML converted document (e.g., English,German, and etc.), and place the document in the processing queue module323.

Once the XML or HTML converted document is out of the processing queuemodule 323, the structuration function module 324 structurally analyzesthe XML or HTML converted document into a plurality of hierarchicallevels. In FIG. 4 , illustrated is a representation of data stored asdiscreet database documents: a sentence level 401, a paragraph level402, a section level 403, and a document level 404. Analyzing thedocuments or data in the structure mentioned above allows locating ofterminologies and clauses used in the contractual documents.

Referring back to FIG. 3 , following the structuration function module324 is the rules processing module 325. In this stage, the discoveryengine 120 applies the predefined rules to generate the predefinedfeatures. The predefined rules determine the logic or sequence of words,sentences, phrases, NLP (natural language processing) features, orterminologies. In addition, the rules processing module 325 generatesthe predefined features from the predefined rules for the end user tocustomize in the analysis engine 130. The predefined features can be akey reference or a descriptive verb that can describe the document andthe information held within. For instance, the predefined features canbe a start date, a termination date, a contract type, and etc.

In addition, the post processing and reduction module 326 reduces andnormalizes the predefined features from the rules processing module 325.It is to be noted that in addition to sentence and paragraph boundaries,the discovery engine 120 can identify contractual section boundariessuch as termination, limitation of liability, indemnity sections of acontract, and etc. Moreover, the post processing and reduction module326 prepares the predefined features for the end user to customize inthe analysis engine 130.

Normalization in the post processing and reduction module 326 reducesthe common notations into a standard format. For instance, the same datecan be expressed in multiple ways (e.g. Oct. 23, 1992, Oct. 23, 1992,10/23/1992, 23/10/1992, 1992/10/23 and etc.), and the normalization canconvert various formats into standard ISO format. Normalizing to thestandard format can eliminate confusions and improve processing speed.Most importantly, by consolidating into same notations, the discoveryengine 120 can reduce any duplicate terms in different formats.

After the feature creation and normalization, the high level processingmodule 327 creates metadata and stores them in the discovery database160. Additionally, the search engine communicatively coupled to thediscovery database 160 obtains initial search results to determine theeligibility for analytics processing. Moreover, the high levelprocessing module 327 prepares the predefined policy as well as theinitial search results in a format that the end user can view.Furthermore, either one or both of an internal search engine and anexternal search engine may perform a search function.

Referring now to FIG. 5 , illustrated is one embodiment of the analysisengine 130, which identifies standard exact clauses, standard clauses,and non-standard clauses. As illustrated, the analysis engine 130includes an analysis engine queue module 531, a variable detectionmodule 570, a custom feature generation module 532, a document parsingmodule 533, a policy definition module 534, a standard clause detectionmodule 535, a standard exact clause detection module 536, a non-standardclause detection module 537, and an update discovery database module538.

The discovery engine 120 transfers a data set including the predefinedpolicy, search indexes, and the initial search results to the analysisengine queue module 531. Following the analysis engine queue module 531,the custom feature generation module 532 allows the end user tocustomize the predefined features obtained from the discovery engine 120and to define primary features.

The variable detection module 570 receives search indexes or the initialsearch results and provides available variations of clauses to thecustom feature generation module 532. The variable detection module 570may receive the search indexes or the initial search results from thediscovery engine 120 directly or from the analysis engine queue module531. The variable detection module 570 may detect allowable variationsof clauses according to examples stored in the discovery engine 120 andprovide the detected allowable variations of clauses with associatedvariables to the custom feature generation module 532.

The custom feature generation module 532 receives the predefinedfeatures from the analysis engine queue module 531 to define primaryfeatures to be used in semantic language evaluation. The custom featuregeneration module 532 may also receive detected allowable variationsfrom the variable detection module 570 to define the primary features.In one approach, the custom feature generation module 532 presents to auser a list of clauses or features within a template. The user mayselect which clauses are to be considered as standard clauses. Inaddition, the user may select which clauses or words in the standardclauses can be varied. In one approach, the user may assign a variableto each set of selected clauses or words allowed to be varied.Alternatively, the custom feature generation module 532 may assign avariable to a set of clauses or words allowed to be changed. The customfeature generation module 532 provides the primary features comprisingselected clauses examples and variables associated with allowablevariations to a document parsing module 533.

Following is an example passage of a document with clause examplesreplaced with associated variables.

EXAMPLE 1

-   -   “absText”: “WatchtowerNumber. Code of Conduct.    -   WatchtowerParty Descriptors acknowledges the terms of        WatchtowerLocation    -   WatchtowerPartyDescriptors Code of Business Conduct\nand Ethics    -   WatchtowerPartySubjectVerb WatchtowerPartyDescriptors (i) that        ail of    -   WatchtowerPartyDescriptors dealings with        WatchtowerPartyDescriptors    -   WatchtowerLocation, whether pursuant to\nthis Agreement or        otherwise, shall be in general alignment with the requirements        of the Code, and (ii)\nnot to induce or otherwise cause any        WatchtowerPartyDescriptors WatchtowerLocation associate to        violate the Code, with Code Number WatchtowerNumber. Should this        be violated, the WatchtowerContractingParties agree to pay        WatchtowerSealMoney within WatchtowerDuration.\n”,        “offsetStart”: 12704, “offsetEnd”: 13085.

In the example passage above, various clauses are replaced withcorresponding variables. Specifically, variations of a contract number,a party involved in the contract, another party involved in thecontract, a specific location, a specific act, amount and duration canbe replaced with a variable “WatchtowerNumber,”“WatchtowerPartyDescriptors,” “WatchtowerContractingParties,”“WatchtowerLocation,” “WatchtowerPartySubjectVerb,”“WatchtowerSealMoney,” and “WatchtowerDuration” respectively.

With the user defined primary features, the document parsing module 533replaces the actual text, phrases or clauses with the primary features.In one embodiment, the document parsing module 533 replaces words orclauses with allowed variations with corresponding variables. Thesemantic language evaluator 140 formed with the primary featuresreplaced data set, ensures the accuracy and quality of the data. Thatis, the semantic language evaluator 140 accounts for minor anomalieswithin the clauses, allowing the analysis engine 130 to locate and groupclauses based on the core semantics. The document parsing module 533transfers clause examples to the semantic language evaluator 140, andthe semantic language evaluator assesses the similarity to each of theexamples. In one exemplary embodiment, the semantic language evaluator140 may be a Latent Symantec Index (LSI) module, which may provide acosine vector score based on the similarity and classify clausesaccordingly. For instance, a cosine vector score of 1 indicates a highdegree of similarity, when 0 indicates a low degree of similarity.

The policy definition module 534 allows the end user to define theprimary policy that includes primary rules, primary features or clauseexamples (herein also referred to as “primary clause examples”) and afirst threshold. In one exemplary embodiment, a recommended value forthe first threshold is ‘95’ or between ‘90’ and ‘99,’ when the semanticlanguage evaluator is the LSI module.

The standard clause detection module 535 obtains standard clauses basedon the primary policy. In one implementation, the standard clausedetection module 535 applies the primary policy with the first thresholdto the semantic language evaluator 140 to obtain the standard clauses.The primary policy with the first threshold allows the analysis engine130 to locate clauses that are almost identical to the primary clauseexamples. The standard clause detection module 535 may provide astandard feature data set comprising the standard clauses to the customfeature generation module 532. The custom feature generation module 532may modify clause examples based on the standard clauses or present thestandard clauses detected to a user to allow a list of clause examplesor allowable variations of clauses to be re-selected. The standardclause detection module 535 may also store the standard feature data setin the discovery database 160.

The standard exact clause detection module 536 obtains the standardexact feature data set comprising standard exact clauses based on theclause examples. In one embodiment, the standard exact clause detectionmodule 536 replaces words or clauses allowed to be changed withcorresponding variables instead of the document parsing module 533. Thestandard exact clause detection module 536 compares each word and anorder of words from a document with each word and an order of words fromclause examples to obtain standard exact clauses exactly matching theclause examples. The textual matching is performed word by word, and inthis example a word can be seen as a token. A token can be made from anycontiguous textual items, numbers, text, symbols. Each token is comparedagainst the clause examples provided within the primary policy, in theexact word order it is within the clause, with the system rejecting anitem as soon as the first Token is found to not match. In oneimplementation, the LSI module may not consider an order of the words,thus the standard exact clause detection module 536 obtains N-Gram ofdifferent words or tokens to compare an order of words. By replacingwords or clauses allowed to be changed with their correspondingvariables, the standard exact clause detection module 536 can reduce anumber of comparisons performed to identify standard exact clauses whiletaking into account for each variation of clause examples.

In one embodiment, the standard exact clause detection module 536 alsoidentifies a candidate standard exact clause including an obscure word(or a character of the word) with poor optical character recognitionbased on the quality score provided from the OCR engine 218. Responsiveto determining the quality of the OCR performed on the obscure word ispoor (e.g., the quality score of the obscure word is below a qualitythreshold value), the standard exact clause detection module 536determines whether qualities of the OCR performed on a preceding wordand a succeeding word of the obscure word are acceptable. If thequalities of the OCR performed on the preceding word and the succeedingword are acceptable, the standard exact clause detection module 536determines whether any of the clause examples and the variables includethe preceding word, a candidate word, and the succeeding word in thatsequence. If a clause example including the preceding word, thecandidate word, and the succeeding word in that sequence is found, aclause including the preceding word, the obscure word, and thesucceeding word is determined to be a candidate standard exact clause.The standard exact clause detection module 536 may add the candidatestandard exact clause to the standard exact feature data set.

The standard exact clause detection module 536 may provide the standardexact feature data set to the custom feature generation module 532. Thecustom feature generation module 532 may modify clause examples based onthe standard exact clauses or candidate standard exact clauses. Thecustom feature generation module 532 may also present the standard exactclauses (or candidate standard exact clauses) detected to a user toallow a list of clause examples or allowable variations of clauses to bere-selected. The standard exact clause detection module 536 may alsostore the standard exact feature data set in the discovery database 160.

The non-standard clause detection module 537 may create a secondarypolicy, which is a copy of the primary policy that does not contain anyrules, but includes a second threshold lower than the first threshold.In one exemplary embodiment, a recommended value for the secondthreshold is ‘60’ or between ‘50’ and ‘70, when the semantic languageevaluator 140 is the LSI module. In addition, the non-standard clausedetection module 537 extracts a mirror feature data set with thesecondary policy. The secondary policy allows the analysis engine 130 tolocate all clauses that are semantically similar to the primary searchexamples. It is to be noted that, not only the mirror feature data setcontains more data, but also contains exact match from the standardfeature data set. That is, the mirror feature data set encompasses thestandard feature data set, where the standard feature data setencompasses the standard exact feature data set.

In one embodiment, the non-standard clause detection module 537subtracts the standard exact feature data set from the mirror featuredata set to obtain the non-standard clauses. In this embodiment,standard clauses that are not standard exact clauses would be identifiedas non-standard clauses.

In another embodiment, the non-standard clause detection module 537subtracts the standard feature data set from the mirror feature data setto obtain the non-standard clauses. In this embodiment, the non-standardclause detection module 537 may obtain the non-standard clauses afterthe standard clauses are obtained in the standard clause detectionmodule 535 but before the standard exact clauses are obtained in thestandard exact clause detection module 536. Alternatively, thenon-standard clause detection module 537 can obtain the non-standardclauses after the standard exact clauses are obtained in the standardexact clause detection module 536.

Once the analysis engine 130 obtains the standard clauses, standardexact clauses and non-standard clauses, the update discovery databasemodule 538 may update the discovery database 160 with the standardclauses, standard exact clauses and the non-standard clauses obtained.

Standard Exact Clause and Non-Standard Clause Detection Process

FIG. 6 illustrates a process of obtaining standard exact clauses andnon-standard clauses. The process may be performed by the variabledetection module 570, policy definition module 534, standard clausedetection module 535, standard exact clause detection module, 536,non-standard clause detection module 537 of the analysis engine 130. Inother embodiments, the steps of FIG. 6 may be performed by different oradditional components. Other embodiments can perform the steps of FIG. 6in different orders. Moreover, other embodiments can include differentand/or additional steps than the ones described here.

The variable detection module 570 receives an input document 610. Thevariable detection module 570 obtains 620 allowable variations ofstandard clauses and a corresponding variable for the variations. Thepolicy definition module 534 obtains 630 the primary policy includingthe primary rules, the primary features, the primary clause examples andthe first threshold for determining similarities. The policy definitionmodule 534 obtains 640 the secondary policy which is a copy of theprimary policy that does not contain any rules but includes a secondthreshold lower than the first threshold. In one embodiment, the primarypolicy, the secondary policy and the allowable variations of standardclauses may be obtained in different orders.

The standard clause detection module 535 obtains 650 a standard featuredata set comprising standard clauses based on primary policy from theinput document. The standard exact clause detection module 536 generates660 a mirror document by replacing allowable variations withcorresponding variables, and obtains 670 a standard exact feature dataset comprising standard exact clauses exactly matching the clauseexamples from the mirror document. Moreover, the non-standard clausedetection module obtains 680 mirror feature data set comprising relatedclauses based on secondary policy from the input document. Furthermore,the non-standard clause detection module 537 obtains 690 a differencebetween the mirror feature data set and the standard exact feature dataset to obtain non-standard clauses.

Policy Definition Process

FIG. 7 illustrates an example process of determining the policy. Theprimary policy provides guidance on how and where to look for contractspecific terminologies. Again, the primary policy may include theprimary rules, the primary features, the primary clause examples and thefirst threshold for determining similarities. Other embodiments canperform the steps of FIG. 7 in different orders. The steps of FIG. 7 maybe performed by custom feature generation module 532, document parsingmodule 533, and policy definition module 534 of the analysis engine 130.In other embodiments, the steps of FIG. 7 may be performed by differentor additional components (e.g., variable detection module 570).Moreover, other embodiments can include different and/or additionalsteps than the ones described here.

In this example, the discovery engine 120 provides a discovery searchindex to the analysis engine 130 to perform a clause example search 710,and presents the predefined clause examples to the end user. The enduser may search for the primary clause examples in the clause selection720, either under a section or a paragraph. If the end user decides tolook for a clause under the section, the custom feature generationmodule 532 loads the feature replaced data in the section selection 721.In a find similar section 723, the document parsing module 533 requeststhe semantic language evaluator 140 to query if similar features existalready within the index. Likewise, if the end user decides to look fora clause under the paragraph, the custom feature generation module 532loads the feature replaced data from the analysis database 150 in aparagraph selection 722. In a find similar paragraph 724, documentparsing module 533 requests the semantic language evaluator 140 to queryif similar features exist already within the index.

The policy definition module 534 enables the end user to select theprimary clause examples from the search results in a clause exampleselection 730. Additionally, the end user may repeat the clauseselection 720, and select new clauses.

Following the clause example selection 730, the policy definition module534 enables the end user to select the primary rules to determine thelogic or sequence of words, sentences, phrases, or terminologies to besearched in a rule selection 740, and to evaluate the selected rule in asentence rule evaluated 750. In addition, the end user may repeat theclause selection 720 to select new clauses to be applied or repeat therule selection 740 to modify selected rules or add additional rules. Thepolicy definition module 534 updates the primary policy as well as theanalysis database 150 in the nested policy definition 760.

In embodiment, the secondary policy may be generated based on theprimary policy, or through the similar steps described above.

Data Storage Method

Referring back to FIG. 4 , illustrated is a representation of datastored as discreet database documents. To enable the detection of thestandard exact clauses and non-standard clauses, each data setassociated with a clause that may contain a unique identification numberor following indexes: a sentence level 401, a paragraph level 402, asection level 403, and a document level 404. In addition to theidentification number, each data set may include actual text or featuresreplaced for the clause, and the position of the clause.

During the process of defining policies and determining the non-standardand standard exact clauses, the discovery engine 120 and the analysisengine 130 communicates frequently with the discovery database 160 andthe analysis database 150 for core processing repository and metadatastorage location. In one exemplary embodiment, both databases containinformation related to policies and the analysis database 150 may residein the same hardware with the discovery database 160. However, the datastructures in the analysis database 150 provide for two differing datasets for each sentence, paragraph, and section: one for exact text andanother for features. Therefore, the storage requirement of the analysisdatabase 150 may be demanding, but the analysis engine 130 can achieveadvanced functionality including feature replacements. To reduce theextra storage requirement, the analysis database 150 may use a pointer,instead of creating copies of the entire data set.

Computing Machine Architecture

Turning now to FIG. 8 , it is a block diagram illustrating components ofan example machine able to read instructions from a machine-readablemedium and execute them in a processor (or controller). Specifically,FIG. 8 shows a diagrammatic representation of a machine in the exampleform of a computer system 800 within which instructions 824 (e.g.,software or program code) for causing the machine to perform (execute)any one or more of the methodologies described with FIGS. 1-7 . That is,the methodologies illustrated and described through FIGS. 1-7 can beembodied as instructions 824 that are stored and executable by thecomputer system 800. In alternative embodiments, the machine operates asa standalone device or may be connected (e.g., networked) to othermachines. In a networked deployment, the machine may operate in thecapacity of a server machine or a client machine in a server-clientnetwork environment, or as a peer machine in a peer-to-peer (ordistributed) network environment.

The machine may be a server computer, a client computer, a personalcomputer (PC), a tablet PC, a set-top box (STB), a personal digitalassistant (PDA), a cellular telephone, a smartphone, a web appliance, anetwork router, switch or bridge, or any machine capable of executinginstructions 824 (sequential or otherwise) that specify actions to betaken by that machine. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute instructions824 to perform any one or more of the methodologies discussed herein.

The example computer system 800 includes a processor 802 (e.g., acentral processing unit (CPU), a graphics processing unit (GPU), adigital signal processor (DSP), one or more application specificintegrated circuits (ASICs), one or more radio-frequency integratedcircuits (RFICs), or any combination of these), a main memory 804, and astatic memory 806, which are configured to communicate with each othervia a bus 808. The processing components are the processor 802 andmemory 804. These components can be configured to operate the engines ormodules with the instructions that correspond with the functionality ofthe respective engines or modules. The computer system 800 may furtherinclude graphics display unit 810 (e.g., a plasma display panel (PDP), aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)).The computer system 800 may also include alphanumeric input device 812(e.g., a keyboard), a cursor control device 814 (e.g., a mouse, atrackball, a joystick, a motion sensor, or other pointing instrument), astorage unit 816, a signal generation device 818 (e.g., a speaker), anda network interface device 820, which also are configured to communicatevia the bus 808.

The storage unit 816 includes a machine-readable medium 822 on which isstored instructions 824 (e.g., software) embodying any one or more ofthe methodologies or functions described herein. The storage unit 816may be implemented as volatile memory (static RAM (SRAM) or dynamic RAM(DRAM)) and/or non-volatile memory (read-only memory (ROM), flashmemory, magnetic computer storage devices (e.g., hard disks, floppydiscs and magnetic tape), optical discs and etc.). The instructions 824(e.g., software) may also reside, completely or at least partially,within the main memory 804 or within the processor 802 (e.g., within aprocessor's cache memory) during execution thereof by the computersystem 800, the main memory 804 and the processor 802 also constitutingmachine-readable media. The instructions 824 (e.g., software) may betransmitted or received over a network 826 via the network interfacedevice 820.

While machine-readable medium 822 is shown in an example embodiment tobe a single medium, the term “machine-readable medium” should be takento include a single medium or multiple media (e.g., a centralized ordistributed database, or associated caches and servers) able to storeinstructions (e.g., instructions 824). The term “machine-readablemedium” shall also be taken to include any medium that is capable ofstoring instructions (e.g., instructions 824) for execution by themachine and that cause the machine to perform any one or more of themethodologies disclosed herein. The term “machine-readable medium”includes, but not be limited to, data repositories in the form ofsolid-state memories, optical media, and magnetic media.

Additional Configuration Considerations

It is noted that although the configurations as disclosed are in thecontext of contracts, the principles disclosed can apply to analysis ofother documents that can include data corresponding to standard exactclauses and non-standard clauses. Advantages of the disclosedconfigurations include promptly identifying (i) exact clauses, (ii)semantically related terminologies and (iii) unusual variations of thesemantically related terminologies in a large volume of documents.Moreover, while the examples herein are in the context of a contractdocument, the principles described herein can apply to other documents,for example web pages, having various clauses.

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Certain embodiments are described herein as including logic or a numberof components, engines, modules, or mechanisms, for example, asillustrated in FIGS. 1-7 . Modules may constitute either softwaremodules (e.g., program code embodied as instructions 824 stored on amachine-readable medium e.g., memory 804 and/or storage unit 816, andexecutable by one or more processors (e.g., processor 802)) or hardwaremodules. A hardware module is tangible unit capable of performingcertain operations and may be configured or arranged in a certainmanner. In example embodiments, one or more computer systems (e.g., astandalone, client or server computer system) or one or more hardwaremodules of a computer system (e.g., a processor or a group of processors(generally, e.g., processor 802)) may be configured by software (e.g.,an application or application portion) as a hardware module thatoperates to perform certain operations as described herein.

In various embodiments, a hardware module may be implementedmechanically or electronically. For example, a hardware module maycomprise dedicated circuitry or logic that is permanently configured(e.g., as a special-purpose processor, such as a field programmable gatearray (FPGA) or an application-specific integrated circuit (ASIC)) toperform certain operations. A hardware module may also compriseprogrammable logic or circuitry (e.g., as encompassed within ageneral-purpose processor or other programmable processor) that istemporarily configured by software to perform certain operations. Itwill be appreciated that the decision to implement a hardware modulemechanically, in dedicated and permanently configured circuitry, or intemporarily configured circuitry (e.g., configured by software) may bedriven by cost and time considerations.

The various operations of example methods described herein may beperformed, at least partially, by one or more processors, e.g.,processor 802, that are temporarily configured (e.g., by software) orpermanently configured to perform the relevant operations. Whethertemporarily or permanently configured, such processors may constituteprocessor-implemented modules that operate to perform one or moreoperations or functions. The modules referred to herein may, in someexample embodiments, comprise processor-implemented modules.

The one or more processors may also operate to support performance ofthe relevant operations in a “cloud computing” environment or as a“software as a service” (SaaS). For example, at least some of theoperations may be performed by a group of computers (as examples ofmachines including processors), these operations being accessible via anetwork (e.g., the Internet) and via one or more appropriate interfaces(e.g., application program interfaces (APIs).)

The performance of certain of the operations may be distributed amongthe one or more processors, not only residing within a single machine,but deployed across a number of machines. In some example embodiments,the one or more processors or processor-implemented modules may belocated in a single geographic location (e.g., within a homeenvironment, an office environment, or a server farm). In other exampleembodiments, the one or more processors or processor-implemented modulesmay be distributed across a number of geographic locations.

Some portions of this specification are presented in terms of algorithmsor symbolic representations of operations on data stored as bits orbinary digital signals within a machine memory (e.g., a computer memory804). These algorithms or symbolic representations are examples oftechniques used by those of ordinary skill in the data processing artsto convey the substance of their work to others skilled in the art. Asused herein, an “algorithm” is a self-consistent sequence of operationsor similar processing leading to a desired result. In this context,algorithms and operations involve physical manipulation of physicalquantities. Typically, but not necessarily, such quantities may take theform of electrical, magnetic, or optical signals capable of beingstored, accessed, transferred, combined, compared, or otherwisemanipulated by a machine. It is convenient at times, principally forreasons of common usage, to refer to such signals using words such as“data,” “content,” “bits,” “values,” “elements,” “symbols,”“characters,” “terms,” “numbers,” “numerals,” or the like. These words,however, are merely convenient labels and are to be associated withappropriate physical quantities.

Unless specifically stated otherwise, discussions herein using wordssuch as “processing,” “computing,” “calculating,” “determining,”“presenting,” “displaying,” or the like may refer to actions orprocesses of a machine (e.g., a computer) that manipulates or transformsdata represented as physical (e.g., electronic, magnetic, or optical)quantities within one or more memories (e.g., volatile memory,non-volatile memory, or a combination thereof), registers, or othermachine components that receive, store, transmit, or displayinformation.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

Some embodiments may be described using the expression “coupled” and“connected” along with their derivatives. For example, some embodimentsmay be described using the term “coupled” to indicate that two or moreelements are in direct physical or electrical contact. The term“coupled,” however, may also mean that two or more elements are not indirect contact with each other, but yet still co-operate or interactwith each other. The embodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of the “a” or “an” are employed to describe elementsand components of the embodiments herein. This is done merely forconvenience and to give a general sense of the invention. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs for asystem and a process for detecting standard exact clauses andnon-standard clauses through the disclosed principles herein. Thus,while particular embodiments and applications have been illustrated anddescribed, it is to be understood that the disclosed embodiments are notlimited to the precise construction and components disclosed herein.Various modifications, changes and variations, which will be apparent tothose skilled in the art, may be made in the arrangement, operation anddetails of the method and apparatus disclosed herein without departingfrom the spirit and scope defined in the appended claims.

What is claimed is:
 1. A non-transitory computer readable medium storingprogram code for determining a presence of a type of clause within aplurality of documents, the program code comprising instructions thatwhen executed by a processor cause the processor to: receive a clauseexample corresponding to the type of clause; generate a primary policybased upon the received clause example for use in a semantic languageevaluator configured to assess a level of semantic similarity betweenreceived clauses, the primary policy comprising one or more policy rulesand associated with a first threshold value indicating a level ofsemantic similarity of a clause to the clause example; analyze, usingthe semantic language evaluator, the plurality of documents according tothe primary policy to automatically provide a first set of clausescorresponding of the plurality of documents, each clause of the firstset corresponding to a standard clause matching the clause example inaccordance with the first threshold; generate a mirror document basedupon the plurality of documents by automatically replacing one or moreportions of the plurality of documents having allowable variations withcorresponding variables; parse the mirror document to generate a secondset of clauses corresponding to a standard exact feature data set;generate a secondary policy based upon the primary policy and the clauseexample for use in the semantic language evaluator, the secondary policyassociated with a second threshold value indicating a level of semanticsimilarity of a clause to the clause example that is lower than thefirst threshold value; analyze, using the semantic language evaluator,the plurality of documents according to the secondary policy toautomatically provide a third set of clauses comprising non-standardclauses semantically related to but not matching the clause example inaccordance with the second threshold, wherein the third set of clausescorresponds to a mirror feature data set; obtain a difference betweenthe mirror feature data set and the standard exact feature data set, thedifference corresponding to non-standard clauses of the plurality ofdocuments; update, automatically, a database to identify the standardand non-standard clauses of the plurality of documents associated withthe type of clause based upon the obtained difference, for subsequentusage in analyzing the plurality of documents.
 2. The non-transitorycomputer readable medium of claim 1, further comprising instructionswhen executed by the processor cause the processor to: receive one ormore features associated with the type of clause; and generate, using asemantic language evaluator, a plurality of feature replaced clauses byautomatically replacing one or more of a plurality of original clausesin the plurality of documents with the one or more features.
 3. Thenon-transitory computer readable medium of claim 1, further comprisinginstructions when executed by the processor cause the processor to:identify a portion of the clause example as corresponding to anavailable variation of the clause example; and replace the availablevariation with a variable.
 4. The non-transitory computer readablemedium of claim 3, further comprising instructions when executed by theprocessor cause the processor to: parse the plurality of documents togenerate the second set of clauses corresponding to the standard exactfeature data set containing clauses matching the clause example basedupon the available variation.
 5. The non-transitory computer readablemedium of claim 1, further comprising instructions when executed by theprocessor cause the processor to: replace one or more clauses of theplurality of documents with one or more features, each feature of theone or more features corresponding to a reference or description of aportion of the plurality of documents, wherein the first policy isgenerated based upon at least one feature of the one or more features.6. The non-transitory computer readable medium of claim 1, furthercomprising instructions when executed by the processor cause theprocessor to: obtain a difference between the first set of clauses andthe third set of clauses corresponding to one or more non-standardclauses.
 7. A computer implemented method for determining a presence ofa type of clause within a plurality of documents, the method comprising:receiving a clause example corresponding to the type of clause;generating a primary policy based upon the received clause example foruse in a semantic language evaluator configured to assess a level ofsemantic similarity between received clauses, the primary policycomprising one or more policy rules and associated with a firstthreshold value-indicating a level of semantic similarity of a clause tothe clause example; analyzing, using the semantic language evaluator,the plurality of documents according to the primary policy toautomatically provide a first set of clauses corresponding of theplurality of documents, each clause of the first set corresponding to astandard clause matching the clause example in accordance with the firstthreshold; generating a mirror document based upon the plurality ofdocuments by automatically replacing one or more portions of theplurality of documents having allowable variations with correspondingvariables; parsing the mirror document to generate a second set ofclauses corresponding to a standard exact feature data set; generating asecondary policy based upon the primary policy and the clause examplefor use in the semantic language evaluator, the secondary policyassociated with a second threshold value indicating a level of semanticsimilarity of a clause to the clause example that is lower than thefirst threshold value; analyzing, using the semantic language evaluator,the plurality of documents according to the secondary policy toautomatically provide a third set of clauses comprising non-standardclauses semantically related to but not matching the clause example inaccordance with the second threshold, wherein the third set of clausescorresponds to a mirror feature data set; obtaining a difference betweenthe mirror feature data set and the standard exact feature data set, thedifference corresponding to non-standard clauses of the plurality ofdocuments; and automatically updating a database to identify thestandard and non-standard clauses of the plurality of documentsassociated with the type of clause based upon the obtained difference,for subsequent usage in analyzing the plurality of documents.
 8. Themethod of claim 7, further comprising: receiving one or more featuresassociated with the type of clause; and generating, using a semanticlanguage evaluator, a plurality of feature replaced clauses byautomatically replacing one or more of a plurality of original clausesin the plurality of documents with the one or more features.
 9. Themethod of claim 7, further comprising: identifying a portion of theclause example as corresponding to an available variation of the clauseexample; and replacing the available variation with a variable.
 10. Themethod of claim 9, further comprising: parsing the plurality ofdocuments to generate the second set of clauses corresponding to thestandard exact feature data set containing clauses matching the clauseexample based upon the available variation.
 11. The method of claim 7,further comprising: replacing one or more clauses of the plurality ofdocuments with one or more features, each feature of the one or morefeatures corresponding to a reference or description of a portion of theplurality of documents, wherein the first policy is generated based uponat least one feature of the one or more features.
 12. The method ofclaim 7, further comprising: obtain a difference between the first setof clauses and the third set of clauses corresponding to one or morenon-standard clauses.
 13. A system for determining a presence of a typeof clause within a plurality of documents, comprising: a documentparsing module configured to receive a clause example corresponding tothe type of clause; a policy definition module configured to: generate aprimary policy based upon the received clause example for use in asemantic language evaluator configured to assess a level of semanticsimilarity between received clauses, the primary policy comprising oneor more policy rules and associated with a first threshold valueindicating a level of semantic similarity of a clause to the clauseexample; and generate a secondary policy based upon the primary policyand the clause example for use in the semantic language evaluator, thesecondary policy associated with a second threshold value indicating alevel of semantic similarity of a clause to the clause example that islower than the first threshold value; an analysis engine configured to:analyze, using the semantic language evaluator, the plurality ofdocuments according to the primary policy to automatically provide afirst set of clauses corresponding of the plurality of documents, eachclause of the first set corresponding to a standard clause matching theclause example in accordance with the first threshold; generate a mirrordocument based upon the plurality of documents by automaticallyreplacing one or more portions of the plurality of documents havingallowable variations with corresponding variables; parse the mirrordocument to generate a second set of clauses corresponding to a standardexact feature data set; analyze, using the semantic language evaluator,the plurality of documents according to the secondary policy toautomatically provide a third set of clauses comprising non-standardclauses semantically related to but not matching the clause example inaccordance with the second threshold, wherein the third set of clausescorresponds to a minor feature data set; obtain a difference between themirror feature data set and the standard exact feature data set, thedifference corresponding to non-standard clauses of the plurality ofdocuments; and update, automatically, a database to identify thestandard and non-standard clauses of the plurality of documentsassociated with the type of clause based upon the obtained difference,for subsequent usage in analyzing the plurality of documents.
 14. Thesystem of claim 13, wherein the document parsing module is furtherconfigured to: receive one or more features associated with the type ofclause; and generate, using a semantic language evaluator, a pluralityof feature replaced clauses by automatically replacing one or more of aplurality of original clauses in the plurality of documents with the oneor more features.
 15. The system of claim 13, wherein the documentparsing module is further configured to: identify a portion of theclause example as corresponding to an available variation of the clauseexample; and replace the available variation with a variable.
 16. Thesystem of claim 15, wherein the analysis engine is further configuredto: parse the plurality of documents to generate the second set ofclauses corresponding to the standard exact feature data set containingclauses matching the clause example based upon the available variation.17. The system of claim 13, wherein the document parsing module isfurther configured to: replace one or more clauses of the plurality ofdocuments with one or more features, each feature of the one or morefeatures corresponding to a reference or description of a portion of theplurality of documents, and wherein the policy definition module isconfigured to generate the first policy based upon at least one featureof the one or more features.
 18. The system of claim 13, wherein theanalysis engine is further configured to: obtain a difference betweenthe first set of clauses and the third set of clauses corresponding toone or more non-standard clauses.
 19. A non-transitory computer readablemedium comprising stored program code for determining a presence of atype of clause within a plurality of documents, the program codecomprising instructions that when executed by a processor cause theprocessor to: receive a clause example corresponding to the type ofclause; generate a primary policy based upon the received clause examplefor use in a semantic language evaluator configured to assess a level ofsemantic similarity between received clauses, the primary policycomprising one or more policy rules and associated with a firstthreshold value indicating a level of semantic similarity of a clause tothe clause example; analyze, using the semantic language evaluator, theplurality of documents according to the primary policy to automaticallyprovide a first set of clauses of the plurality of documents, eachclause of the first set corresponding to a standard clause matching theclause example in accordance with the first threshold; generate a mirrordocument based upon the plurality of documents by automaticallyreplacing one or more portions of the plurality of documents havingallowable variations with corresponding variables; generate a secondarypolicy based upon the primary policy and the clause example for use inthe semantic language evaluator, the secondary policy associated with asecond threshold value indicating a level of semantic similarity of aclause to the clause example that is lower than the first thresholdvalue; analyze, using the semantic language evaluator, the mirrordocument according to the secondary policy to automatically provide asecond set of clauses corresponding to a mirror feature data setcomprising non-standard clauses semantically related to but not matchingthe clause example in accordance with the second threshold; obtain adifference between the mirror feature data set and the first set ofclauses, the difference corresponding to non-standard clauses of theplurality of documents; and update, automatically, a database toidentify the standard and non-standard clauses of the plurality ofdocuments associated with the type of clause based upon the obtaineddifference, for subsequent usage in analyzing the plurality ofdocuments.
 20. The non-transitory computer readable medium of claim 19,further comprising instructions that when executed by the processorcause the processor to: parse the mirror document to generate a set ofstandard exact clauses corresponding to a standard exact feature dataset based on the clause example; and update, automatically, the databaseto identify the standard exact clauses of the plurality of documents.